Fingerprinting walking in NHANES

1: apply ADEPT to all subjects in NHANES

2: Take any seconds that ADEPT identifies as having any steps

3: Filter to find some consecutive seconds of walking

4: Calculate “fingerprints” from these seconds

5: Predict identities

6: Associate fingerprints with scalars (age, sex, mortality risk, etc)

Distribution of walking

A bout is considered walking if there is no more than two seconds between seconds identified as walking and the bout is at least 10 seconds long. We can look at the distribution of these bouts:

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  10.00   11.00   14.00   19.12   21.00 1387.00 
99.5% 
  108 

We can also look at the distribution of the max consecutive bout, per person.

We can also look at the distribution of total walking time:

We will impose some cutoff for number of minutes of walking needed to be included in the analysis. We can examine the effect of different cutoffs:

    99% 
13319.1 

pct cutoff
97 0.5
94 1.0
89 2.0
85 3.0
81 4.0
78 5.0

Based on this, it seems like a cutoff of 180 seconds (3 minutes) seems reasonable, we are left with 85% of individuals.

We can check if the excluded differ from the included in any systematic way:

Characteristic

TRUE
N = 12,235

1

FALSE
N = 2,000

1
age_in_years_at_screening 33 (15, 52) 27 (7, 67)
gender

    Female 6,173 (50%) 1,064 (53%)
    Male 6,062 (50%) 936 (47%)
bin_mobilityproblem 1,013 (13%) 605 (57%)
total_scsslsteps 9,205 (6,897, 11,937) 5,494 (3,146, 7,930)
num_valid_days

    0 707 (5.8%) 351 (18%)
    1 360 (2.9%) 115 (5.8%)
    2 402 (3.3%) 70 (3.5%)
    3 445 (3.6%) 80 (4.0%)
    4 569 (4.7%) 76 (3.8%)
    5 929 (7.6%) 122 (6.1%)
    6 1,884 (15%) 251 (13%)
    7 6,939 (57%) 935 (47%)
total_PAXMTSM 14,796 (12,083, 17,836) 13,862 (9,457, 19,343)
general_health_condition

    Poor 230 (1.9%) 108 (5.4%)
    Fair 1,702 (14%) 391 (20%)
    Good 4,604 (38%) 627 (31%)
    Very good 3,653 (30%) 429 (21%)
    Excellent 2,046 (17%) 445 (22%)
1

Median (Q1, Q3); n (%)

It seems that those excluded are younger, likely to take fewer steps, have a mobility problem, and have fewer valid days of accelerometry data.

Check some walking segments

Obtaining fingerprints

Once we have the walking segments, we can calculate fingerprints. We only use individuals with at least 180 seconds of walking. To ensure each person is equally represented in the sample, we randomly select 180 seconds from each person, and calculate the fingerprints for those seconds. We use grid cell size of 0.25\(g\) and lags of 12, 24, and 36 samples, which corresponds to 15, 30, and 45Hz, respectively.

For each individual, we have 180 observations of \(144 * 3 = 432\) potential predictors.

Model fitting

To fit models, we employ one vs. rest logistic regression. First, we remove predictors with near-zero variance as follows:

Code
nzv_trans =
  recipe(id ~ ., data = cells) %>%
  step_nzv(all_predictors())

nzv_estimates = prep(nzv_trans)

nzv = colnames(juice(nzv_estimates))

# plot 
cells %>% 
  select(id, all_of(nzv)) %>% 
  summarize(across(-id, ~mean(.x))) %>% 
  pivot_longer(cols = everything()) %>% 
  mutate(lag = sub(".*\\_", "", name),
         name = sub("_[^_]*$", "", name)) %>% 
  right_join(df, by = c("name" = "clean_names")) %>% 
  filter(!is.na(lag)) %>%
  ggplot(aes(x = cut_sig, y = cut_lagsig, fill = value)) +
  facet_grid(.~lag) +
  geom_tile(col = "black") + 
  scale_fill_viridis(name = "Mean value across all subjects") +
  labs(x = "Signal", y = "Lag Signal", title = "Most frequent grid cells") + 
  theme(axis.text.x = element_text(angle = 45),
        legend.position = "bottom") 

We then perform the regression. Because we have so many subjects, we perform the regressions in folds of increasing size, to determine how accuracy changes with the number of subjects in the models.

Prediction results

We see that accuracy decreases with increasing number of subjects per fold. We can try boosting models, or oversampling, where we sample 100x (with replacement) for the individual who the model is being fit on.

Additional scenarios: you are in the training, but not in the test. Vary the test to be different proportions of the training data.

In the test, but not in the training where you want to know some diagnostic that indicates maybe this person isn’t in there at all. Like a confidence measure.

Probabilities here represent max probability assigned to the impostor, or the probability that individual is true individual. The following is for 1000-person folds.

Association of fingerprints with covariates

Next, we can regress the fingerprints on different covariates.

First, we remove grid cells with near zero variance. Then, we calculate for each subject the proportion of time spent in each grid cell. Finally, we perform separate regressions for each grid cell:

\[\text{time in cell}_i = \beta_0 + \beta_1\text{sex}_i \] \[\text{time in cell}_i = \beta_0 + \beta_1\text{age in years}_i \]

\[\text{time in cell}_i = \beta_0 + \beta_1\text{mortality at 5 years}_i \]

We do this for each cell, then plot the results. We adjust the p-values for multiple comparisons using Bonferroni.

We can make finer grid cells to see if we can see any more fine-grained effects.